Imitation Learning via Kernel Mean Embedding
نویسندگان
چکیده
Imitation learning refers to the problem where an agent learns a policy that mimics the demonstration provided by the expert, without any information on the cost function of the environment. Classical approaches to imitation learning usually rely on a restrictive class of cost functions that best explains the expert’s demonstration, exemplified by linear functions of pre-defined features on states and actions. We show that the kernelization of a classical algorithm naturally reduces the imitation learning to a distribution learning problem, where the imitation policy tries to match the state-action visitation distribution of the expert. Closely related to our approach is the recent work on leveraging generative adversarial networks (GANs) for imitation learning, but our reduction to distribution learning is much simpler, robust to scarce expert demonstration, and sample efficient. We demonstrate the effectiveness of our approach on a wide range of high-dimensional control tasks. In imitation learning, an agent learns to behave by mimicking the demonstration provided by the expert, situated in an environment with an unknown cost function. A classical approach to imitation learning is behavioral cloning, where the policy is directly learned to map from states to actions by supervised learning (Sammut 2010). Unfortunately, this straightforward approach does not generalize well to unseen states, often requiring a large amount of training data. A more principled approach is apprenticeship learning (AL), where the policy is sought that is guaranteed to perform at least as well as the expert (Russell 1998; Ng, Russell, and others 2000; Abbeel and Ng 2004). However, to formally meet the guarantee, AL algorithms typically assume a restrictive class of cost functions and a planner that yields a sufficiently accurate optimal policy for a cost function. This does not reflect the complex nature of high-dimensional dynamics in real-world problems. On the other hand, deep neural networks have been shown strong predictive power to model complex functions: the parametric function via networks is highly flexible and expressive, which can be efficiently trained by stochastic gradient descent. Representing the cost function and the agent policy using neural networks shall yield a plausible policy that faithfully imitates the expert’s demonstrated behaviors Copyright c © 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. in high-dimensional control tasks. In this line of thought, Ho and Ermon (2016) presented generative adversarial imitation learning (GAIL), which casts the objective of imitation learning as the training objective of generative adversarial networks (GANs) (Goodfellow et al. 2014). The key insight behind GAIL is that imitation learning reduces to matching the state-action visitation distributions (i.e. occupancy measure) of the learned policy to that of the expert policy, under a suitable choice of the penalty on the cost function. However, GAIL often exhibits unstable training in practice due to the alternating optimizations of the generator and discriminator networks to address the minimax objective function, a well-known challenge in training GANs. In this work, we show that extending the class of cost functions to the reproducing kernel Hilbert space (RKHS) alternatively reduces the imitation learning to the distribution learning problem under the maximum mean discrepancy (MMD), a metric on probability distributions defined in the RKHS. However, our derivation is much simpler and more natural. Although the derivation is almost immediate, our work is the first to present the kernelization of a classical AL algorithm (Abbeel and Ng 2004), and establish analogies with the state of the art imitation learning algorithm, i.e. GAIL. The advantage of our work is that the training becomes simpler yet robust to local optima since the hard minimax optimization is avoided. As an end result, our work becomes closely related to generative moment matching networks (GMMNs) (Li, Swersky, and Zemel 2015) and MMD nets (Dziugaite, Roy, and Ghahramani 2015), two approaches to training deep generative neural networks using the MMD. Our experiments on the same set of highdimensional control imitation tasks with the identical settings as in the GAIL paper, with the largest task involving 376 observation and 17 action dimensions, demonstrate that the proposed approach performs better than or on a par with GAIL, and significantly outperforms GAIL particularly when the expert demonstration is scarce, with performance gain up to 41%. Background MDPs and Imitation Learning We define basic notation for our problem setting and briefly review relevant work in the literature. We assume learning in an environment that can be modeled as a Markov decision process (MDP), with state space S, action space A, transition model p(s′|s, a), initial state distribution p0(s), and cost function c(s, a). We assume the total discounted cost with discount rate γ, so that the long-term cost of a policy π : S → p(A) is defined as Jc(π) , Es0,a0,···[ ∑∞ t=0 γ c(st, at)], where the trajectory [s0, a0, · · · ] is generated by s0 ∼ p0(s0), st+1 ∼ p(st+1|st, at), at ∼ π(at|st),∀t ≥ 0. We also define the state-action value function Qc (s, a) , Est=s,at=a,···[ ∑∞ t′=t γ t′−tc(st′ , at′)], the state value function V π c (s) , Ea∼π(·|s)[Q π c (s, a)] and the advantage function Ac (s, a) , Q π c (s, a)− V π c (s). For any policy π, there is a one-to-one correspondence between the policy and its occupancy measure (Puterman 1994) ρπ(s, a) , Es0,a0,···[ ∑∞ t=0 γ δs,a(st, at)] = ∑∞ t=0 γ p(st = s, at = a) (1) where δ is the Kronecker delta function. Essentially, this is an unnormalized visitation distribution over states and actions, which sums to 1/(1 − γ). The long-term cost is then simply the expected cost under the occupancy measure, Jc(π) = ∑ s,a c(s, a)ρπ(s, a) , Es,a∼ρπ [c(s, a)]. When the state and the action spaces are large or continuous, the classical approach is to work with pre-defined features ~ φ(s, a) ∈ ~ φ(s, a). In this case, it is easy to see that using the feature expectation ~ μπ , Es0,a0,···[ ∑∞ t=0 γ t~ φ(st, at)] = Es,a∼ρπ [ ~ φ(s, a)], (2) the long-term cost is Jc(π) = ~ w>~ μπ by the linearity of expectation. In imitation learning, we assume demonstration dataset DπE provided by the expert, whose behavior is governed by policy πE unknown to the agent. The goal is to compute policy π that best approximates πE given the demonstration dataset, without any further information about the underlying MDP model, except a simulator that can sample trajectories given any policy π. Apprenticeship learning (AL) is an approach to imitation learning, which formalizes this problem by defining a class of cost functions C and seeks policy π that performs as well as πE for all c ∈ C, i.e. Jc(π) ≤ Jc(πE) for all c ∈ C Note that if the true cost function is in C, the policy π that satisfies this inequality is guaranteed to perform as well as, or better than, πE . This constraint satisfaction problem can be reformulated as an optimization problem, with the objective min π ψ∗ C(π, πE) where ψ ∗ C(π, πE) = max c∈C [Jc(π)− Jc(πE)] (3) The AL algorithm by Abbeel and Ng (2004) can be seen as choosing the cost function class C`2 = { c(s, a) = ~ w>~ φ(s, a) ∣∣∣ ‖~ w‖2 ≤ 1} which yields ψ∗ C`2 (π, πE) = max ‖~ w‖2≤1 [~ w>~ μπ − ~ w>~ μπE ] (4) whereas Multiplicative Weights Apprenticeship Learning (MWAL) by Syed and Schapire (2007) is associated with the cost function class C∆ = { ~ w>~ φ(s, a) ∣∣∣ ~ w ∈ ∆} where ∆ denotes the simplex constraint, i.e. ~ w ≥ ~0 and ∑ wi = 1. This yields ψ∗ C∆(π, πE) = max ~ w∈∆ [~ w>~ μπ − ~ w>~ μπE ] (5) The minimax optimization problem in Eqn. (3) typically involves repetitive computations of optimal policies for intermediate cost functions (Abbeel and Ng 2004; Syed and Schapire 2007). This is intractable for large-scale problems, especially those with continuous action spaces. Ho, Gupta, and Ermon (2016) presented a gradient-based stochastic optimization approach, where the parameterized policy (typically a neural network) πθ is found by alternating between computing the cost function c∗ that attains maximum in ψ∗ C with fixed πθ, and then improving πθ by gradient descent using ∇θψ c∗(πθ, πE). This approach can be used with different cost function classes, e.g. C`2 or C∆. In the experiments, we will refer to these two gradient-based versions of AL as Feature Expectation Matching (FEM) and Game-Theoretic Apprenticeship Learning (GTAL). Kernel Mean Embedding and MMD This paper seeks a kernel-based imitation learning algorithm that does not rely on explicit features ~ φ(s, a). The very first step is to realize that the feature expectation given in Eqn. (2) is actually the mean embedding of the distribution ρπ using the feature ~ φ. This motivates the use of kernel mean embedding, which extends the classical kernel approach to probability distributions (Smola et al. 2007). Specifically, choosing a kernel implies an implicit feature map φ that represents a probability distribution as a mean function,
منابع مشابه
Bayesian Learning of Kernel Embeddings
Kernel methods are one of the mainstays of machine learning, but the problem of kernel learning remains challenging, with only a few heuristics and very little theory. This is of particular importance in methods based on estimation of kernel mean embeddings of probability measures. For characteristic kernels, which include most commonly used ones, the kernel mean embedding uniquely determines i...
متن کاملKernel Mean Embedding of Distributions: A Review and Beyond
A Hilbert space embedding of a distribution—in short, a kernel mean embedding—has recently emerged as a powerful tool for machine learning and inference. The basic idea behind this framework is to map distributions into a reproducing kernel Hilbert space (RKHS) in which the whole arsenal of kernel methods can be extended to probability measures. It can be viewed as a generalization of the origi...
متن کاملCharacteristic and Universal Tensor Product Kernels
Kernel mean embeddings provide a versatile and powerful nonparametric representation of probability distributions with several fundamental applications in machine learning. Key to the success of the technique is whether the embedding is injective. This characteristic property of the underlying kernel ensures that probability distributions can be discriminated via their representations. In this ...
متن کاملScalable Kernel Embedding of Latent Variable Models∗
Kernel embedding of distributions maps distributions to the reproducing kernel Hilbert space (RKHS) of a kernel function, such that subsequent manipulations of distributions can be achieved via RKHS distances, linear and multilinear transformations, and spectral analysis. This framework has led to simple and effective nonparametric algorithms in various machine learning problems, such as featur...
متن کاملKernel Mean Estimation via Spectral Filtering
The problem of estimating the kernel mean in a reproducing kernel Hilbert space (RKHS) is central to kernel methods in that it is used by classical approaches (e.g., when centering a kernel PCA matrix), and it also forms the core inference step of modern kernel methods (e.g., kernel-based non-parametric tests) that rely on embedding probability distributions in RKHSs. Previous work [1] has show...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2017